Chromosome-scale genome assembly and annotation of Cotoneaster glaucophyllus

Cotoneaster glaucophyllus is a semi-evergreen plant that blossoms in late summer, producing dense, attractive, fragrant white flowers with significant ornamental and ecological value. Here, a chromosome-scale genome assembly was obtained by integrating PacBio and Illumina sequencing data with the aid of Hi-C technology. The genome assembly was 563.3 Mb in length, with contig N50 and scaffold N50 values of ~6 Mb and ~31 Mb, respectively. Most (95.59%) of the sequences were anchored onto 17 pseudochromosomes (538.4 Mb). We predicted 35,856 protein-coding genes, 1,401 miRNAs, 655 tRNAs, 425 rRNAs, and 795 snRNAs. The functions of 34,967 genes (97.52%) were predicted. The availability of this chromosome-level genome will provide valuable resources for molecular studies of this species, facilitating future research on speciation, functional genomics, and comparative genomics within the Rosaceae family.


Background & Summary
Species of the genus Cotoneaster Medic.belong to the Malinae subtribe of the Rosaceae family 1 , and are primarily distributed in continental Eurasia, with a remarkable species diversity in the biodiversity hotspots of the Himalayas and the Hengduan Mountains (HDM) 2 .Taxonomic difficulties for this genus have been caused by various evolutionary events, including hybridization, polyploidization, and apomixis.A comprehensive phylogenetic analysis of this genus has been conducted using genome-skimming data, but with the genome of Eriobotrya japonica serving as the mapping reference 3 , which might introduce mapping errors, incorrect alignments, difficulties in identifying orthologous genes, and genome annotation issues.
Based on morphological characteristics and molecular evidences, two subgenera or sections have been proposed: Cotoneaster, characterized by predominantly red or pink flowers with erect petals, and Chaenopetalum, noted for its primarily white flowers with spreading petals [2][3][4][5] .Notably, only approximately 10% of Cotoneaster species are diploid 2 .Cotoneaster glaucophyllus, as a representative member of the Chaenopetalum subgenus and a diploid species, has a distinct distribution in the southeastern of Hengduan Mountains and on the Yunnan-Guizhou Plateau.It is a semi-evergreen shrub that blossoms in late summer, exhibiting dense, showy, fragrant white flowers, and bears long-lasting fruits in early winter, potentially making it an important ornamental plants 2,6,7 .With continuous advancements in sequencing technology, abundant genome resources for numerous Rosaceae species have been extensively documented [8][9][10][11][12] .However, the lack of whole-genome sequencing in Cotoneaster species has been a significant obstacle in further understanding the gene functions, evolutionary history, and conservation of this complicated genus (up to 370 species).
Using the Pacific Biosciences (PacBio) platform, we generated ~117 Gb of DNA continuous long reads (CLRs) and obtained ~48 Gb of full-length transcriptome sequences.Additionally, we sequenced ~104 Gb of DNA reads and ~10 Gb of RNA reads (2 × 150 bp) as well as ~62 Gb of high-throughput chromosome conformation capture (Hi-C) reads based on the Illumina HiSeq platform.With the aid of Hi-C technologies, we finally provided a high-quality genome sequence for the diploid species (2n = 2x = 34) of C. glaucophyllus (Fig. 1).
DNA and RNA extraction and genome sequencing.Total DNA was extracted from fresh leaves using the Plant Genomic DNA Kit (DP305, Tiangen Biotech Co., Ltd., Beijing, China).The qualified DNAs were used to construct libraries intended for single molecular real-time (SMRT) sequencing using the Pacific Biosciences system (Menlo Park, CA, USA), Illumina sequencing, and Hi-C sequencing.The 20 kb library was prepared following the manufacturer's protocol 13 .For the Illumina DNA paired-end library, the NEBNext ® UltraTM DNA Library Prep Kit was utilized according to the provided instructions, with an insert size of 350 bp.The Hi-C library was prepared following standard procedures 14 .
Samples including fresh leaves, flowers, fruits, roots, and stems were pooled for total RNA extraction using the TIANGEN RNAPrep Pure Plant kit (DP432, Tiangen Biotech Co. Ltd., Beijing, China).Subsequently, the qualified RNAs were utilized for synthesizing full-length cDNAs with the SMRTer PCR cDNA Synthesis Kit (Biomarker, Beijing).Full-length transcriptome sequencing was performed on the PacBio Sequel platform.Additionally, short RNA-Seq reads (2 × 150 bp) specifically from leaf samples were generated and processed 15 to facilitate the correction of the long-read RNA sequencing data and genome annotation.
PacBio long-read sequencing was performed using the PacBio Sequel system, while high throughput sequencing (2 × 150 bp) was carried out using an Illumina HiSeq sequencer.Both sequencing processes were conducted at Novogene Bioinformatics Technology Co., Ltd.(Beijing, China).
Pre-estimation of genomic characteristics.The generated Illumina sequencing data were primarily processed using the NGSQC Toolkit v2.3.3 16 .This processing was involved in discarding reads that had adaptor contamination, reads with more than 10% unknown nucleotides (N), and paired reads that contained over 20% bases with a quality score of less than 5 in either read.Then, we performed a genome survey using Jellyfish v.2.2.7 17 with the default setting of k-mer = 17 (Fig. 2).Based on a kmer-based statistical approach, GenomeScope v.2.0 18 was used to estimate genome heterozygosity, repeat content, and size.To initially assess the genomic complexity, we employed SOAPdenovo v.2.0.4 19 to generate a de novo draft assembly using a k-mer length of 41.The assembled contigs were then utilized to calculate the guanine-cytosine (GC) content.The estimated genome size was determined to be 625.87Mb, with a heterozygosity rate of 0.55% and a repeat sequence proportion of 54.97%.Moreover, the estimated GC content was 38.65%.

Genome assembly and quality assessment.
The FALCON assembler 20 was initially employed to perform self-correction of PacBio subreads.Subsequently, preassembled reads were assembled using the overlap-layout-consensus (OLC) algorithm, resulting in consensus contigs.To enhance the accuracy of the results, high-quality contigs were further corrected using Illumina short DNA reads through Pilon 21 .Leveraging the clean Hi-C data, the LACHESIS tool 22 was utilized to scaffold the assembly, ultimately yielding a chromosome-level assembly.The de novo genome assembly was 563.3 Mb in length, with a contig N50 of ~6 Mb and a scaffold N50 of ~31 Mb (Table 1).
To comprehensively evaluate the reliability of the assembly, multiple assessments were performed in addition to considering the contig/scaffold N50 length.First, the integrity of the assembly was assessed by mapping the assembled genome to the BUSCO (Benchmarking Universal Single-Copy Orthologs) database v2.0 23 (BUSCO, RRID: SCR 015008) and the CEGMA v2.5 24 (Core Eukaryotic Genes Mapping Approach, RRID: SCR 015055).The BUSCO database contains 1,440 conserved core genes in terrestrial plants, while CEGMA includes a subset of the 248 most highly-conserved Core Eukaryotic Genes (CEGs).Second, the consistency between the assembly and paired-end Illumina short reads was evaluated by calculating the mapping and coverage rates.The Burrows-Wheeler Aligner (BWA) v0.7.15 25 was used to align the 150 bp short reads to the assembly.Thirdly, assembly accuracy was assessed by conducting SNP calling using SAMtools v1.9 26 and BCFtools v1.9 (https:// github.com/samtools/bcftools)based on the above mapping results.The rates of homozygous and heterozygous single-nucleotide polymorphisms (SNPs) were also determined.

Pseudo-chromosome ID Cluster number (contigs) Sequences length (bp) GC content (%)
Gene functions were assigned by aligning the protein sequences to Swiss-Prot 52 using Blastp 53 , with a threshold of E-value ≤ 1e−5, and the best match was considered.Motifs and domains were annotated using InterProScan v5.31 54 , which involved searching against publicly available databases, including ProDom 55 , PRINTS 56 , Pfam 57 , SMART 58 , PANTHER 59 , and PROSITE 60 .Gene Ontology (GO) IDs were assigned to each gene based on the corresponding InterPro entry.Protein function predictions were made by transferring annotations from the closest BLAST hit (E-value ≤ 1e−5) in the SwissProt database 51 and DIAMOND v0.8.22 61 hit (E-value ≤ 1e−5) in the NR database.Additionally, we mapped the gene set to a KEGG pathway and identified the best match for each gene.The functions of 34,967 genes (97.52%) were predicted (Table 6).Comparative analysis of gene elements among Rosaceae-related species revealed that the genome assembly of Cotoneaster glaucophyllus exhibits a shorter average exon length (229.78 bp) and a longer average intron length (508.51 bp) than those of other considered species (Fig. 4, Table 7).

Data Records
The raw data of Hi-C short reads, Illumina DNA short reads, PacBio DNA long reads, RNA short reads, and PacBio RNA long reads have been deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive database with accession numbers SRR25933879 62 , SRR25933878 63 , SRR25933877 64 , SRR25933876 65 , and SRR25933875 66 under BioProject accession number PRJNA1012579.The genome assembly has been deposited at GenBank under the WGS accession JAVVNS000000000 67 .Additionally, the genome assembly, predicted transcripts and protein sequences, functional annotation files (gff files), and NR and KEGG annotation files have been deposited in Figshare 68 .

technical Validation
Multiple parameters were employed to assess the quality of the genome assembly.The BUSCO evaluation indicated that among the Eukaryota BUSCO genes, 62.9% (906) of the sequences were identified as complete and single-copy, while 30.3% (436) were complete but duplicated.Additionally, 1.1% ( 16) of the sequences   were fragmented, and 5.7% (82) were found to be missing.Analysis of the 248 most highly-conserved Core Eukaryotic Genes (CEGs) revealed the presence of 238 complete genes (95.97%) and 6 incomplete genes (2.42%).The evaluation of the consistency between the assembly and paired-end DNA short reads indicated that the overall mapping and coverage rates were 94.61% and 99.99%, respectively.The rates of homozygous and heterozygous single-nucleotide polymorphisms (SNPs) were 0.001413% (798) and 0.288695% (163,081).Furthermore, we mapped the DNA continuous long reads (CLRs) to the genome using the minimap2 69 , and calculated the sequencing depth and coverage for each pseudo-chromosome (Table 2).These results collectively demonstrate a genome assembly of high quality, completeness, and accuracy.

Fig. 2
Fig. 2 Frequency distribution of depth and K-mer numbers (A) and frequency distribution of depth and K-mer types (B).

Fig. 4
Fig.4 Comparative analysis of gene elements among Rosaceae-related species.

Table 1 .
Statistics of genome assembly.
49ing PASA49(Program to Assemble Spliced Alignment) terminal exon support and including masked transposable elements as aninput for gene prediction.Furthermore, gene structure and gene elements, including average transcript length, average CDS length, and average exon and intron length, were compared among Cotoneaster glaucophyllus and the above six related species.

Table 3 .
Summary of interspersed repetitive sequences.

Table 4 .
Statistics of gene structure prediction.Note: The asterisk (*) indicates the inclusion of UTR regions.

Table 5 .
Statistics of noncoding genes.

Table 6 .
Summary of gene function annotations.

Table 7 .
Comparative analysis of gene elements among Rosaceae-related species.